Skip to content

Conversation

@tiferet
Copy link
Contributor

@tiferet tiferet commented Nov 14, 2022

Implement the new query that selects data for training.

For now we include clauses that implement logic that is identical to the old queries, so that the final dataset is identical, but we also note as TODO items some constraints we may want to experiment with removing. We include a temporary wrapper query that converts the resulting data into the format expected by the endpoint pipeline.

This PR moves the couple of small pieces of ExtractEndpointData that are still needed into ExtractEndpointDataTraining.qll, and deletes ExtractEndpointDataTraining.ql, ExtractEndpointDataTraining.qll, and the associated test files.

Timing checks:

Closes https://github.com/github/ml-ql-adaptive-threat-modeling/issues/2098

Implement the new query that selects data for training. For now we include clauses that implement logic that is identical to the old queries.

Include a temporary wrapper query that converts the resulting data into the format expected by the endpoint pipeline.

Move the small pieces of `ExtractEndpointData` that are still needed into `ExtractEndpointDataTraining.qll`.
Also remove the associated test files.
@github-actions github-actions bot added the ATM label Nov 14, 2022
@tiferet tiferet marked this pull request as ready for review November 15, 2022 01:20
@tiferet tiferet requested review from a team and kaeluka and removed request for a team November 15, 2022 01:20
@tiferet
Copy link
Contributor Author

tiferet commented Nov 15, 2022

If the tests pass (they've been running for almost an hour now) and the KPI experiment is OK, then this PR will be ready for review when you get to work on Tuesday.

@tiferet
Copy link
Contributor Author

tiferet commented Nov 15, 2022

Nice -- the checks finally passed ✅

@tiferet
Copy link
Contributor Author

tiferet commented Nov 15, 2022

@kaeluka The KPI timing experiment is OK, so this PR is now ready for review 🏓

@owen-mc owen-mc changed the title Extract training data ATM: Extract training data Nov 15, 2022
kaeluka
kaeluka previously approved these changes Nov 15, 2022
Copy link

@kaeluka kaeluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this looks good to me. I've left a few nitpicks (in other words: not necessary to address before merging — feel free to address in your next PR).

I've tried to use the suggestions feature where possible to make this as effortless as possible on your part.

The one choice you should make before merging is what to do about my suggestion to delete the logic backing up the hasFlowFromSource value. If you accept the suggestion, you'd need to also fix the .expected file. You may ignore that suggestion, merge and I'll send a PR tomorrow that removes the flag after you merge this PR.

exists(endpoint.getFile().getRelativePath()) and
// Only select endpoints that can be part of a tainted flow: Constant expressions always evaluate to a constant
// primitive value. Therefore they can't ever appear in an alert, making them less interesting training examples.
// TODO: Experiment with removing this requirement.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: Experiment with removing this requirement.
// TODO: Turn this requirement into a characteristic.

(nitpick)

..right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I suppose we could. Would that characteristic have any (positive or negative) implications for any class? Or are you suggesting adding it as a characteristic with no implications, simply so that the modeling code could use it in type balancing?

Either way it would need to be done as an experiment, because it could impact ATM metrics, so we'd have to check whether it helps or hurts them.

@tiferet
Copy link
Contributor Author

tiferet commented Nov 15, 2022

Meta comment: In all the cases above, my goal in this PR was to reproduce the current data exactly, so that we won't need to do end-to-end testing before merging this PR. Once this basic framework is merged, the next step would be to open a PR that adjusts the logic in all the ways we think it should be adjusted, then run end-to-end testing. If metrics aren't hurt we could merge that PR and run a partial update process. Until the orchestrator is fully functional, that's a non-trivial process, so I'd rather not do it many times unnecessarily. That's why I prefer a strict separation between PRs that change the framework but leave the data identical, and can be verified with PR checks, followed by a small number of targeted PRs that improve the underlying logic/data but require end-to-end testing.

Co-authored-by: Stephan Brandauer <kaeluka@github.com>
Copy link

@kaeluka kaeluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still lgtm 👍

@tiferet
Copy link
Contributor Author

tiferet commented Nov 15, 2022

Still lgtm 👍

Thank you! I had some followup questions about some of your comments (e.g. #11263 (comment)), so let's keep discussing them in this PR even after it's merged. When we reach a conclusion I'll implement it in a subsequent PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants